Enterprise Database Systems
Accessing Data with Spark
Accessing Data with Spark: An Introduction to Spark
Accessing Data with Spark: Data Analysis using Spark SQL
Accessing Data with Spark: Data Analysis Using the Spark DataFrame API

Accessing Data with Spark: An Introduction to Spark

Course Number:
it_dsadskdj_01_enus
Lesson Objectives

Accessing Data with Spark: An Introduction to Spark

  • recognize where Spark fits in with Hadoop and its components
  • describe Spark RDDs and their characteristics, including what makes them resilient and distributed
  • identify the types of operations which are permitted on an RDD and describe how RDD transformations are lazily evaluated
  • distinguish between RDDs and DataFrames and describe the relationship between the two
  • list the crucial components of Spark and the relationships between them and recognize the functions of the Spark Session, Master and Worker nodes
  • install PySpark and initialize a Spark Context
  • create and load data into an RDD
  • initialize a Spark DataFrame from the contents of an RDD
  • work with Spark DataFrames containing both primitive and structured data types
  • define the contents of a DataFrame using the SQLContext
  • apply the map() function on an RDD to configure a DataFrame with column headers
  • retrieve required data from within a DataFrame and define and apply transformations on a DataFrame
  • convert Spark DataFrames to Pandas DataFrames and vice versa
  • describe basic Spark concepts

Overview/Description

Explore the basics of Apache Spark, an analytics engine for working with big data that is built on top of Hadoop. Discover how it allows operations on data with both its own library methods and with SQL while delivering great performance.



Target

Prerequisites: none

Accessing Data with Spark: Data Analysis using Spark SQL

Course Number:
it_dsadskdj_03_enus
Lesson Objectives

Accessing Data with Spark: Data Analysis using Spark SQL

  • recall the different stages involved in optimizing any query or method call on the contents of a Spark DataFrame
  • create views out of a Spark DataFrame's contents and run queries against them
  • trim and clean a DataFrame before a view is created as a precursor to running SQL queries on it
  • perform an analysis of data by running different kinds of SQL queries, including grouping and aggregations
  • recognize how Spark DataFrames infer the schema of data loaded into them and configure a DataFrame with an explicitly defined schema
  • define what a window is in the context of Spark DataFrames and when they can be used
  • create and analyze categories of data in a dataset using Windows
  • analyze data using Spark SQL

Overview/Description

Analyze a Spark DataFrame by treating it as though it were a relational database table. Discover how to create a view from a Spark DataFrame and run SQL queries against it and how to define and explore data in Windows.



Target

Prerequisites: none

Accessing Data with Spark: Data Analysis Using the Spark DataFrame API

Course Number:
it_dsadskdj_02_enus
Lesson Objectives

Accessing Data with Spark: Data Analysis Using the Spark DataFrame API

  • recognize the features that make Spark 2.x versions significantly faster than Spark 1.x
  • specify the reasons for using shared variables in your Spark application and distinguish between the two options available for sharing variables
  • create a Spark DataFrame from the contents of a CSV file and apply some simple transformations on the DataFrame
  • define a transformation to view a random sample of data from a large DataFrame
  • apply grouping and aggregation operations on a DataFrame to analyze categories of data in a dataset
  • use Matplotlib to visualize the contents of a Spark DataFrame
  • perform operations to prepare your dataset for analysis by trimming unnecessary columns and rows containing missing data
  • define and apply a generic transformation on a DataFrame
  • apply complex transformations on a DataFrame to extract meaningful information from a dataset
  • work with broadcast variables and perform a join operation with a DataFrame that has been broadcast
  • use a Spark accumulator as a counter
  • store the contents of a DataFrame in a text file for archiving or sharing
  • define and work with a custom accumulator to count a vector of values
  • perform different join operations on Spark DataFrames to combine data from multiple sources
  • analyze data using the DataFrame API

Overview/Description

Explore how to analyze real datasets using the DataFrame API methods. Discover how to optimize operations using shared variables and combine data from multiple DataFrames using joins.



Target

Prerequisites: none

Close Chat Live